Are Activity Wrist-Worn Devices Accurate for Determining Heart Rate during Intense Exercise?

The market for wrist-worn devices is growing at previously unheard-of speeds. A consequence of their fast commercialization is a lack of adequate studies testing their accuracy on varied populations and pursuits. To provide an understanding of wearable sensors for sports medicine, the present study examined heart rate (HR) measurements of four popular wrist-worn devices, the (Fitbit Charge (FB), Apple Watch (AW), Tomtom runner Cardio (TT), and Samsung G2 (G2)), and compared them with gold standard measurements derived by continuous electrocardiogram examination (ECG). Eight athletes participated in a comparative study undergoing maximal stress testing on a cycle ergometer or a treadmill. We analyzed 1,286 simultaneous HR data pairs between the tested devices and the ECG. The four devices were reasonably accurate at the lowest activity level. However, at higher levels of exercise intensity the FB and G2 tended to underestimate HR values during intense physical effort, while the TT and AW devices were fairly reliable. Our results suggest that HR estimations should be considered cautiously at specific intensities. Indeed, an effective intervention is required to register accurate HR readings at high-intensity levels (above 150 bpm). It is important to consider that even though none of these devices are certified or sold as medical or safety devices, researchers must nonetheless evaluate wrist-worn wearable technology in order to fully understand how HR affects psychological and physical health, especially under conditions of more intense exercise.


Introduction
Currently, wearable technology applied to biomedical data control in sports medicine is widespread among different types of users (e.g., athletes, patients, and people who practice sports or exercise), and is continuously growing [1][2][3][4][5][6][7][8][9][10][11][12][13]. Exercise as medicine is a global health initiative encouraging physicians and other healthcare professionals to include physical activity evaluation and training programs in every patient visit [14]. In this framework, wearable technology applied to health and fitness has been confirmed as a ubiquitous technology that helps to enhance performance and prevent injury [3,7,8,12,15,16]. Indeed, Google Trends 2020 reports with respect to the consumers' increasing interest in wearable devices indicate that they are probably attracted by a desire to improve their quality of life [17]. heart rate monitors, the accuracy of HR measurement in wrist wearables with PPG is uncertain [56,[62][63][64][65][66]. The accuracy of any wearable medical device is increasingly relevant, as it can influence both medical decisions and patient outcomes [15,19,42,43,60]. In cardiac patients as well as other patients, after hospital discharge [67] an integral validation of these devices is vital in light of the need to monitor the patient's cardiac rehabilitation through the recommended HR thresholds [68][69][70].
Furthermore, there are additional limitations for wrist-worn devices that could have negative effects. In steady-state aerobic exercises such as cycling, wrist-worn activity devices have proven to be reasonably accurate in estimating HR [63,71]. However, in non-steady exercises such as body weight lifting, CrossFit, and other high-intensity interval training (HIIT) exercises, accuracy is compromised [54], especially at HR frequencies higher than 150 bmp [72][73][74]. Furthermore, because these devices are typically worn on the nondominant wrist during these multiple forms of exercise, motion artifacts may introduce noise in the detected signal [27,[75][76][77]. Hence, for fitness professionals who supervise users to monitor their physical activity it is crucial to evaluate the accuracy of these devices.
Consequently, considering the growing popularity of wearable devices with consumers and researchers [4,5,8,10,56,57], queries about their reliability are becoming of paramount importance, along with the need to assess their validity and accuracy for health and recreation purposes. Unlike clinically approved devices, further research is needed to evaluate these devices. Therefore, a deeper analysis of the data obtained through these new devices is urgent, as only 5% of these technologies have been formally validated [10,63,[78][79][80].
The current investigation aims to analyze the accuracy of four of the most popular wrist-worn devices, Fitbit Charge (FB), Apple Watch (AW), Tomtom Runner Cardio (TT), and Samsung G2 (G2), as a function of different intensity levels during a maximum stress test performed either on a treadmill or a cycle ergometer. A second goal of this study is to determine whether these devices are suitable as medical devices for assessing exercise safety or user health. The rest of this paper is structured as follows. Section 2 describes the selection of the participants in Section 2.1, the study settings in Section 2.2, the tested heart rate devices in Section 2.3, and the statistical analysis in Section 2.4. The results and data analysis are presented in Section 3 and discussed in Section 4. Finally, Section 5 provides our conclusions based on the main findings from this research.

Participant Selection
The inclusion criteria for subject selection were: athletes aged 18 to 55 performing regular practice of a competitive sport in national and regional tournaments for at least two years prior to the study and not having any current physical limitations, medical conditions, or psychiatric conditions. All subjects trained two to four times a week between 1 and 3 h/day. The volunteers maintained this sports practice until the day before the present study was carried out. Prior to admittance to the study, all subjects were evaluated for their cardiovascular health. None of the volunteers reported any respiratory or cardiac disease, presenting average spirometric values. Eight active and healthy athletes (85.7% male) who met the aforementioned inclusion criteria, volunteered to participate in this study. The athletes performed an incremental exercise test on a treadmill or a cycle ergometer while wearing two devices at the same time, one on each wrist, while electrocardiography data were recorded.

Study Setting
The exercise tests were performed in the Physiology Laboratory at the Professional School of Sports Medicine of the Faculty of Medicine, Universidad Complutense de Madrid, Spain. In conformity with the review policy statement, the experimental protocol was approved by the Ethics Committee of the Hospital Clinico San Carlos, Madrid (HCSC) (no: 16/123-E) and conducted according to the Helsinki Declaration. All subjects provided their written informed consent to participate after the procedure and the study risks had been explained to them.
The maximal stress tests were carried out either on a treadmill ergometer with an incremental protocol (fixed slope of 1%) reaching a final speed of 16 km/h, or on a cycle ergometer with an incremental protocol of 25 W/min reaching a power that oscillated between 250 and 350 W, with all the athletes reaching HR frequencies higher than 150 bpm.
HR time series were extracted from the devices and compared with the HR data obtained from the electrocardiogram. During the athletes' preparation, twelve ECG electrodes were placed for the 12-lead ECG reading. The area was first prepared by shaving and alcohol sterilization to ensure a correct electrode position while wearing a tubular mesh top. Subsequently, blood pressure was taken to establish a baseline measurement and electrocardiographic readings were taken at rest in supine and standing positions. At the beginning of the test, time and data were synchronized among the electrocardiogram and the wrist-worn devices. Parameter readings and measurements during the stress test were collected every 10 seconds. In addition, a mask was placed over the athlete's nose and mouth in order to prevent air leakage and properly analyze expired gases, as shown in Figure 1. Performance athlete during the exercise stress test on the treadmill. The HR device is placed on the wrist, the electrodes are placed for electrocardiographic recording, and the mouthpiece is for the gas flow analyzer.

HR Measuring Devices
For this study, four of the most popular wrist-worn devices (Fitbit Charge (Fitbit, CA, USA), Apple Watch (Apple, CA, USA), Tomtom runner Cardio (Tomtom, Amsterdam, The Netherlands), Gear S2 (Samsung, Suwon, Republic of Korea)) were selected to verify the HR accuracy because of their commercial availability, cost-effectively, light-weight, and popularity. The accuracy of these four wrist devices during a maximal exercise stress test was evaluated using a Norav 12-lead exercise electrocardiographic (ECG) recorder as the gold standard. The participants wore two different devices on each wrist, and wore the ECG at the same time. The data extraction procedure and characteristics of each device are specified in the following subsections.

Tomtom Runner Cardio (TT)
Exercise data extraction was straightforward using the standard tcx format from Garmin. The data recording interval was one second.

FitBit (FB)
This device featured the option of exporting exercise data in the standard tcx format from Garmin. The data recording interval was five seconds.

Apple Watch (AW)
This device did not offer a data export function from a specific exercise session to a computer; extraction is only possible globally. The device produces a folder containing a compressed XML file with the complete measurement history (the file size can easily exceed 200 MB). Although the document's format is specific to the device's brand, a simple parser was written to select XML tags related to the pairs of data (time, HR frequency) during the stress test. The data recording interval was five seconds.

Gear S2 (G2)
This device did not offer the possibility of extracting the data to a file that could be directly exported. The method for extracting the HR time series lay in producing screenshots of the S-Health application and exporting the data in numerical form using an ad hoc algorithm. The data recording interval was five seconds.

Statistical Analysis
Following the British Standards Institution 2019 standards for medical electrical equipment, the accuracy root mean square (A rms ) difference between the HR values of the device and the HR values of the ECG as a gold standard was calculated [81]. The A rms includes a combination of a standard systematic error or bias component and a random component, providing a single number as a measurement of both bias and precision. Therefore, the A rms statistic is usually evaluated when the overall accuracy of a device is tested [82].
Furthermore, the difference in accuracy between each device and the gold standard ECG was assessed by determining the interclass correlation coefficients (ICC) and their 95% CI, then constructing Bland-Altman graphs. Mean overall scores were compared between the devices and ECG using Student's t-test for paired samples. For each test, the level of significance was set at 5%. Statistical analysis was performed using the SPSS 15.0 package. Quantitative variables are presented with their means and standard deviations (SD).
Lastly, the absolute percent errors (APE) with respect to ECG were calculated to construct ordered boxplots stratified by HR. The APE is provided by The Tukey outlier detection rule was used to find any extreme outliers in APE values for each device and metric combination [83]. Spearman's rank correlation (R) was calculated for each device, with the results provided in the Supplementary Material, Figure S2.

Results
The tested devices, athletes' sports, and anthropometric characteristics such as age, size, weight, height, and body mass index (BMI) are presented in Table 1. Based on the BMI and VO 2,max , respectively, all eight volunteers (seven men and one woman) were classified as having normal weight and good physical fitness condition [84]. Figure 2 represents the HR measurements provided by the wristband for four athletes compared to the reference ECG data in the two different tests, treadmill (top panels, (a) and (b)) and cycle ergometer (bottom panels, (c) and (d)) at all exercise intensities. In all the graphs, the black line corresponds to the reference curve related to the ECG data, which is considered the benchmark of the test, while the blue and green lines correspond to the tested devices. The differences with the ECG data are marked by error bars. As can be seen, the HR rises steadily according to the athlete's increasing effort intensity during the test. The accuracy root-mean-square (A rms ) was calculated for each device. The TT and AW devices have the lowest value, while the G2 has the highest value, which indicates worse accuracy.

Figure 2.
Wristband measurements for four athletes compared to the reference ECG data in the two different tests: treadmill (top panels, (a,b)) and cycle ergometer (bottom panels, (c,d)). The black line corresponds to the ECG data, while the blue and green lines correspond to the tested devices. The error bars denote differences between the ECG data and the tested device.
All devices show high sensitivity to motion artifacts and fail to follow accurate HR when the athletes reach levels of maximum effort (higher HR). Motion artifacts are more perceptible in the treadmill test than in the cycle ergometer test due to the athlete's movements. However, a high variability exists between the devices under the same conditions, i.e., the same type of test (cycle ergometer or treadmill) and HR. A comparison of the performance of the devices for the eight volunteers and a deeper analysis of the Spearman's rank correlation (R) can be found in the Supplementary Materials, Figures S1 and S2.
All available data from each instrument were processed in Bland-Altman form in order to obtain a more global view of the devices' performance. Data was reduced to measurements every 10 s (the available ECG data rate) by averaging or interpolating the extracted data from the four devices. The results provided by the four tested devices at all exercise intensities are shown in Figure 3. The blue dots correspond to the data extracted from the cycle ergometer tests, while the red dots correspond to the tests on the treadmill. The green areas display the limits of agreements for each device, and the dashed green line corresponds to the mean difference. The A rms values are specified for each test and device. High variability and significant inaccuracies between the ECG and the device HR measurements at high exercise intensities were observed among the different tests. For this reason, we performed a detailed statistical study based on different HR ranges (less than 110 bpm, from 110 to 150 bpm, and greater than 150 bpm). Table 2 shows the differences in the paired means between the ECG and each device for the whole sample and stratified by HR range (≤100 bpm, 110-150 bpm y > 150 bpm). We analyzed 1,286 simultaneous HR data pairs between the four tested devices and the ECG used as the reference standard. There were a total of 321 pairs from the TT runner, 440 from the AW, 377 from the FB, and 148 from the G2. The mean (SD) of each device, mean (SD) of the difference between the tested device measurement and the reference standard, mean relative difference (SD; %), mean absolute difference (SD; %), correlation between the measures, and interclass correlation coefficients (ICC) were determined, along with their 95% CIs.
Lastly, device boxplots stratified by HR intervals were performed to assist with visualization and analysis. Following Equation (2), Figure 4 shows the absolute percent errors (APE) for each device. The box limits show the range of 50% of the data, with a center black line designating the median value. The lines extending from each box represent the range of the remaining data, with dots placed there to represent outlier values. The empty dots refer to the most distant outliers. The green boxes correspond to HR measurements below 100 bmp, the blue boxes correspond to HR measurements between 100-150 bmp, and the red boxes correspond to HR measurements greater than 150 bmp.

Discussion
This current investigation examined how effectively four popular wrist-worn activity monitors (Fitbit Charge, Apple Watch, Tomtom Runner Cardio, and Gear S2) estimated HR throughout a maximal test performed either on a treadmill or a cycle ergometer. We found reasonable accuracy in HR estimation for two of these devices (AW and TT), especially at lower-intensity exercises, which is consistent with earlier studies [31,43,62,69,85]. Our findings indicate a positive difference in averages between the ECG and each device. Therefore, the tested devices tend to underestimate HR concerning the ECG, which is more noticeable in the case of the lower accuracy devices, namely, the FB and the G2. These results concur with other previous studies that examined HR estimations across various devices [63,71], although these studies only carried out monitoring after HR had stabilized under steady-state settings. According to another assessment, HR readings are typically more accurate on a cycle ergometer or at rest than on a treadmill [86]. Indeed, exercises involving an unstable wrist, such as those performed on elliptical machines or when walking or running provide less accurate HR readings than exercises involving a stable wrist, such as the cycle ergometer [87,88].
Although the lowest A rms when measuring HR were observed for the TT and the AW, as shown in panels (a) and (b) of Figure 3, significant variability between the different tests can be observed. Motion artifacts such as oscillation or arm movements are more visible for HR frequencies above 130 bmp. Indeed, in the case of AW, considerable differences appear for the treadmill test due to motion artifacts (see the red dots). On the contrary, for FB the appearance of a large cluster of blue points in the top part of panel (c) indicates that the A rms is higher for the cycle ergometer than the treadmill. Similar behavior was observed in a comparative study of the Fitbit Charge 2 and Garmin Vivosmart HR. Here, a significantly lower relative error was found for activities with repetitive motion of the upper torso compared to activities with no repetitive motion of the upper torso, such as the cycle ergometer test [54]. These unexpected differences could be a consequence of the configuration of the cycle ergometer, which has a magnetic brake that can interfere with non-shielded devices [89,90]. The G2 observed a remarkably high bias (mean difference) at almost all levels of exercise intensity (see panel (d) of Figure 3), which is in agreement with the reported results of a previous study [63] In addition, motion artifacts present one of the most challenging problems for HR estimations under extreme activity settings [91,92]. Prior studies have revealed a lack of accuracy of PPG sensors when determining HR during activities involving significant physical exertion or repetitive contractions of the muscles [27,52,87,93,94], particularly above 150 bpm [74,95]. PPG signals can be obstructed, resulting in poor data quality due to a reduction in contact between the device's PPG sensor and the skin during activities involving prolonged muscle contractions or more intense workouts [27,94]. In addition, according to Parak et al., the type of sensor and the position in which the device is worn are significant factors determining the accuracy [96].
Within the last decades, effective algorithms to improve the quality of the signal in the presence of motion artifacts for exercises performed above 150 bpm have been developed [87,91,[97][98][99][100], for instance, by processing context information using additional on-body sensors and light sources [101,102], adaptive noise cancellation using accelerometers as a noise reference [103], adaptive noise cancellation using an integrated PPG sensor [104], deep learning methods [105], and techniques based on spectral analysis, such as traditional fast Fourier transform (FFT) [27].
The majority of previous studies carried out on the determination of HR on wrist-worn devices have shown limited accuracy [85,[106][107][108], typically with measurements that may be somewhat understated [76,[109][110][111]. A number of studies have attempted to correlate wearable HR measurements by PPG with those from a reference ECG signal as a gold standard [20,51,85,93,109]. Indeed, Boudreaux et al. simultaneously determined the accuracy of eight wearable devices, six wrist-worn, one chest-worn, and one ear-worn, during a graded cycling exercise test and during a structured resistance exercise regimen [111]. They found a significant underestimation of HR as exercise intensity increased across all devices; however, none of these studies analyzed accuracy according to HR stratification.
A study of FB and AW for very light, light, moderate, vigorous, and very vigorous intensities based on ECG-measured HR showed that the AW had the lowest relative error rate compared to FB at all exercise intensities; however, the accuracy of both devices was reduced as exercise intensity increased [112]. A study evaluating the accuracy of the Polar M600 optical heart rate monitor during various physical activities reported the highest HR percent accuracy during cycle intervals and the lowest during circuit weight training. In addition, there was a tendency toward HR underestimation as intensity increased and toward overestimation when intensity decreased. The accuracy was higher during periods of steady-state cycling, walking, jogging, and running, though less accurate as intensity increased [113].
In this context, [62] tested the accuracy of six types of wearable devices on 50 volunteers walking or running on a treadmill. In their study, the TT showed the best accuracy while the FB showed the poorest, which is similar to the results we report here. According to a recent systematic review of studies of various models of FB devices against reference measures of energy expenditure, heart rate, and steps, FB devices are likely to underestimate heart rate. However, this underestimation can, on average, be acceptable for steps and heart rate [114].
Moreover, under conditions of intense physical activity the accuracy of heart rate measurement is significantly decreased [64,110,115]. Studies have yet to critically compare devices with a gold standard method approved by the FDA. Indeed, the manufacturers have yet to propose a robust validation system for these devices [61,63,68]. The type of exercise and the conditions in which the exercise is performed influence the reproducibility of HR measurement by these devices [74,116,117]. Although there are many studies about accuracy in the determination of HR in wrist-worn devices during rest [118,119], as well as in different physical activities (sitting, standing, walking slowly, walking fast, running, cycling, etc.) [49,63,64,74,76,85,106,110,[120][121][122][123][124][125][126][127][128][129][130][131], to the best of our knowledge there has been no research into the reliability of these specific devices focused on heart rate ranges compared to conventional ECG results.
In this framework, one of our goals was to determine a threshold HR value at which motion artifacts become significant. For our analysis based on stratified HR ranges, see Table 2; it can be seen that the TT and AW devices present insignificant differences in means concerning the ECG in the intervals <100 HR, with ICC = 0.683 and ICC = 0.799, respectively. Between 100-150 HR, they show a higher degree of accuracy, with good or excellent accuracy when ICC = 0.938 and ICC = 0.930, respectively. It is somewhat surprising that the ICC is lower for HR < 100 bmp than in the 100-150 HR interval. This fact could be attributed to distortion of the PPG signal or delays in the device response time to HR variations. Indeed, as Iyriboz et al. assumed, during heavy exercise for HR > 155 bmp the oscillations of the pulse pressure waveform are distorted in a way that interferes with the PPG signal [74]. Figure 3 shows that the highest differences for TT start at 128 bmp for the treadmill test, with an A rms = 9.39, while for the cycle ergometer test the outlier values appear at HRs greater than 150 bmp, with an A rms = 5.85. In the case of the AW, although there are isolated dots for HR between 100 and 120 bmp for both tests, the A rms = 18.34 for the treadmill test, while for the cycle ergometer we obtain a low A rms = 4.47. In addition, as the intensity increases in the HR ≥ 150 bmp range the level of accuracy is moderate, with ICC = 0.528 and ICC = 0.729, respectively, maintaining minor differences in the means concerning the ECG. These results corroborate those of Boudreaux et al. [111] and Wang et al. [69], who reported that the AW was highly accurate in measuring heart rate during graded exercise cycling and various aerobic activities, respectively.
Furthermore, for the FB and G2 the differences between the HR averages and the ECG data increase significantly as the HR increases, showing a poor level of accuracy, especially for HR ≥ 150 HR, for which the ICCs are 0.019 and −0.024, both with p values less than 0.001. Panel (c) of Figure 3 shows a high A rms = 25.25 for the FB device in the cycle ergometer test at frequencies between 100-150 bmp, with a low ICC = 0.148. In contrast, a lower A rms = 10.70 was estimated for the treadmill test. Both of these results are consistent with the study by Reddy et al. [54]. Lastly, the performance of the G2 provides inaccurate measurements, with ICC < 0.005 in the three studied HR ranges. This low performance may have resulted from improper development of the HR determination algorithms. Figure 4 shows a box plot representing the APE for each of the four devices we evaluated stratified by HR intervals. As can be observed, the TT and the AW display lower APE values for HRs below 100 bmp and between 100 and 150 bmp, while their accuracy is reasonable for frequencies above 150 bmp. The FB device offers an acceptable level of accuracy for HR values below 100 bmp and between 100 and 150 bmp; however, the precision is unreliable for frequencies above 150 bmp. The G2 has strong APE values across all three examined ranges. Figures S1 and S2 of the Supplementary Material show the threshold values at which motion artifacts become to be perceptible for all four devices.
To the best of our knowledge, this study is the first to assess the reliability of these particular wrist-worn wearable devices (FB, TT, AW, G") based on various HR ranges while examining the impact of exercise intensity. Our findings demonstrate that HR accuracy is markedly compromised across all devices as exercise intensity increases. Therefore, our stratified and correlated study should be taken into account when prescribing exercise, especially for people with underlying heart disease.

Main Implications and Future Perspectives
The aim of this study was to assess the accuracy of the chosen wearables as well as to determine whether these devices are suitable as medical devices to assess exercise safety or user health. It is important to increase the precision of the equipment that measures medical parameters in order to assist athletes and patients with heart disease and lower the risk of harm when exercising. It is essential to keep in mind that even if none of these devices is certified or sold as a medical or safety device, their use is widespread within the population, particularly in occasional and non-professional athletes [1,14,21]. In addition, not all wrist activity monitors are made to exact requirements. A number of them have demonstrated unreliable accuracy when used for various activities and exercise settings, including extreme physical activity [10,23,63,[78][79][80]. It is worth mentioning that this behavior can result in inaccurate or underestimated readings of the relevant physical effort level, which can cause harmful behavior for unaware users.
Indeed, in order to fully understand how HR affects psychological and physical health, future research to evaluate wrist-worn wearable technology is needed in order to maximize the usefulness of new technologies, clarify the accuracy of physiological data under conditions of more intense exercise, and clearly resolve researchers' claims to satisfy the FDA-approved gold standard. Regulatory standards must be prepared to ensure the process of accurate evidence accumulation. Considering that companies rarely fully validate new wearable models, it is important to use caution when comparing our findings to earlier models, as it is unknown whether the sensors or algorithms have changed.

Strengths and Limitations of This Study
This research has the following strengths over prior studies. First, this study simultaneously assessed the HR accuracy of four popular wrist-worn activity monitors using the most recent estimation techniques. For the calibration and validity of wrist-worn activity monitoring devices, the use of "unit calibration" allows the signals to be appropriately monitored. In these studies, it is imperative to evaluate different parameters such as exercise intensity [129,130], different skin pigmentation [132], sex, age range, and fitness condition [133]. Second, we analyzed the absolute percent error for different HR ranges in order to determine the frequency range at which motion artifacts become noticeable. In addition, a stress test was performed on two different ergometers, namely, a cycle ergometer and a treadmill. Third, participants in this study received in-depth education and training, meaning that they were already familiar with how the devices worked. Fourth, utilizing an ECG as the gold standard was a sensible decision that prevented system error resulting from using instrument measurements. Lastly, for the analysis of the extracted data, we used a rigorous statistical methodology, the Bland-Altman method, which is considered the most appropriate statistical method for evaluating the measurement of biomedical variables [134]. Many investigations often engage in inappropriately analysis [68] by using correlation coefficients [115]. This statistical methodology is suitable for studying different HR ranges and the variance according to these HR ranges [45].
The study does, however, have certain limitations. First, the monitoring data were collected from participants in a laboratory under controlled conditions; therefore, the outcomes might not accurately mirror those in real-world settings. The amount of random and incidental error resulting from measurements obtained in the subject's natural state of life can be significantly reduced by imposing specific constraints on the subject's activity settings. Second, because reduced PPG accuracy is linked to increased wearable movement [135], there are various significant parameters that have been demonstrated to affect HR accuracy, including wrist placement, wrist circumference, device tightness, dominant vs. non-dominant hand use, and degree of wrist movement [109,136]. Lastly, only one crosssectional measurement was made for this study on seven male and one female volunteer athletes of different ages, BMIs, and VO 2,max with both light and dark skin. No additional longitudinal measures were made. The sample size for female participants (1) and those with dark skin (1) were not large enough to reach statistically significant conclusions, as described in other studies [120,132]. However, these limited sample sizes was partially offset by the simultaneous HR measurements made with a variety of wearable devices.

Conclusions
The main goal of the current study was to evaluate the performance and accuracy of four commercially available wrist-worn wearables for monitoring HR at various activity levels. Our results show that as exercise intensity increases there is a higher underestimation of HR across all devices. The FB and G2 have medical device features that do not meet the FDA-approved gold standard, and both are significantly incorrect above 150 bmp. Particularly significant is that in cardiac rehabilitation, where many of these devices are used, efficient intervention is needed to manage the intensity of physical exertion in order to produce accurate HR readings at high-intensity levels (above 150 bpm). On the contrary, the wrist-worn wearables developed by Apple and TT demonstrate the highest validity for monitoring HR during a physical activity at different levels. Therefore, the validity of exercise recommendations based on the heart rates measured by these devices is acceptable. However, because these devices are frequently used to collect physiological data in longterm medical investigations, more research must focus on varied populations and pursuits to validate these findings. Furthermore, manufacturers might find this comparison helpful in determining the general applicability of measurements provided by various vendors.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/bioengineering10020254/s1, Figure S1: Wristband measurements for the eight athletes compared to the reference ECG data in the two different tests: (a,b) treadmill, and (c-h) cycle ergometer. Figure S2: Scatter plots showing simultaneous HR measurements from ECG (x-axis) compared with each device (y-axis) in the different tests. The Spearman's rank correlation (R) is shown for each case. Funding: This research was funded by Higher Sports Council, Ministry of Culture and Sport, FUNDER grant number (01/UPB10/07) "Estudio de la saturación de oxígeno, en deportistas mujeres de raza negra, durante la realiza-ción de pruebas de esfuerzo máximas" and FUNDER grant number: (01/UPB10/08) "Análisis y modelado de la relación entre el Electrocardiograma, fotopletismografía laser y parámetros ventilatorios durante la ejecución de pruebas de esfuerzo máximas en deportistas de ambos sexos".

Institutional Review Board Statement:
The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Institutional Review Board (or Ethics Committee) of Hospital Clinico San Carlos (HCSC) no: 16/123-E.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.
Data Availability Statement: Not applicable.

Acknowledgments:
The authors would like to express their thanks to all of the athletes who participated in the study, and to all the physicians from the Department of Sports Medicine and personnel from the Professional School of Sports Medicine. A.M.C and P.M. are thankful for support from ANID Project SA22I0178. A.M.C is thankful for support from UTA-Project 4722-22 and ANID Project SA77210039.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: